Support raw HSACO custom kernargs by AWoloszyn · Pull Request #12 · ROCm/hrx-system

AWoloszyn · 2026-05-28T17:53:33Z

Preserve native kernarg layouts for raw HSACO kernels so prepacked HIP argument buffers from libraries such as hipBLASLt can dispatch correctly through the AMDGPU HAL.

stellaraccident · 2026-05-28T23:56:28Z

CI caught an error.

Co-authored-by: Ben Vanik <ben.vanik@gmail.com>

Hoist raw HSACO kernel iteration into executable creation so the resolver only writes a single host kernel-args record.

Keep implicit-argument layout metadata in the dispatch kernarg layout instead of duplicating it in the hot kernel-args table.

benvanik

Second-pass review notes. I agree this is moving in a better direction overall, but I think there are still a few contract issues that need tightening before this is safe to merge. Most of them come from one pressure point: the raw/native kernarg model is starting to share helpers with the regular HAL/flatbuffer model, and those two projections need to stay separate.

1. Raw kernarg reflection leaks into flatbuffer executable loading

The new default HSACO reflection projection computes constant_count from the full aligned native kernarg segment:

hsaco_metadata.c:L1706-L1725

That helper is still used by the flatbuffer executable path, which validates it against the flatbuffer ExportDef.constant_count:

executable.c:L1535-L1544

For a normal wrapped executable with two 64-bit bindings and two 32-bit constants, the native kernarg segment is 24 bytes, so this projects constant_count == 6. The flatbuffer ABI still says there are 2 constants. That means the flatbuffer path can reject valid wrapped executables because the raw-HSACO projection is now pretending padding and binding bytes are HAL constants.

The same leak shows up in parameter offsets. The public HAL contract says binding parameter offsets are binding ordinals:

executable.h:L123-L131

But populate_default_export_parameters now writes raw kernarg byte offsets for bindings:

hsaco_metadata.c:L1779-L1786

Suggested fix: split this into two explicit projections instead of trying to make one "default" helper mean both things.

Flatbuffer/HAL projection: compact visible metadata into HAL ABI reflection, with binding offsets as binding ordinals and constants as compact byte offsets/counts. This is what the wrapped executable path validates against ExportDef.
Raw HSACO/native projection: preserve native kernarg offsets and expose enough byte coverage for custom-direct prepacked argument blobs. This should be used only by the raw HSACO executable path.

That split also gives the names room to state the invariant. Something like calculate_hal_export_parameter_requirements versus calculate_native_kernarg_export_parameter_requirements would make accidental cross-use much harder.

2. Custom-direct validation accepts truly short argument buffers for the common no-implicit-suffix case

iree_hal_amdgpu_executable_initialize_dispatch_descriptor leaves custom_kernarg_layout as all-zero unless an implicit suffix is detected:

executable.c:L1142-L1157

Both planners then use custom_kernarg_layout.explicit_kernarg_size as the required minimum:

When there is no implicit suffix, that required size is zero. So the relaxed validation accepts any custom direct blob length, including one that is actually shorter than the declared kernel kernarg segment. The writer then uses the caller's length as the dynamic layout size:

dispatch.c:L99-L113

That is looser than the intended "allow missing trailing ABI padding" rule. It can accept a genuinely short prepacked buffer and dispatch with missing argument bytes.

Suggested fix: even for dynamic custom-direct layouts, preserve a validated minimum explicit byte count from metadata/symbol information. The dispatch-time rule should be roughly:

iree_host_align(constants.data_length, 8) >= layout->explicit_kernarg_size

Then the writer may clamp extra trailing bytes and zero missing padding up to the reservation. The dynamic part can still mean "the exact reservation may be caller-sized when no implicit suffix exists", but the minimum legal byte count should not be zero unless the kernel really has a zero-byte kernarg ABI.

3. Host-queue custom-direct block count truncates before the safety check

The host queue path computes a runtime block count from constants.data_length and immediately casts it to uint32_t:

host_queue_dispatch.c:L245-L250

The ring-capacity check later sees only the truncated value:

host_queue_dispatch.c:L304-L310

But the actual emplace path still memset/memcpys the full constants.data_length:

dispatch.c:L99-L113

So an oversized iree_host_size_t length can wrap to a small block count, pass validation, reserve too few queue kernarg blocks, and then overrun the queue-owned kernarg allocation. This is security/correctness-sensitive because the direct host-queue path is not protected by the command-buffer qword length guards.

Suggested fix: keep the ceil-div result in iree_host_size_t, reject values greater than UINT32_MAX, then cast. It would be worth sharing a tiny checked helper with executable descriptor initialization so every path that converts kernarg byte length to block count has the same overflow behavior.

4. ELF symbol augmentation can read past malformed section bounds

The new synthetic kernel augmentation scans ELF symbol sections, but the appended-symbol pass checks only that symbol_section_offset <= elf_data.data_length:

hsaco_metadata.c:L1413-L1421

It does not check symbol_section_size <= elf_data.data_length - symbol_section_offset. The first thing iree_hal_amdgpu_hsaco_metadata_is_elf_kernel_symbol does is read symbol[4]:

hsaco_metadata.c:L1304-L1314

The second pass repeats the same unchecked range pattern:

hsaco_metadata.c:L1503-L1511

So malformed ELF section metadata can produce out-of-bounds symbol reads before the later symbol-name helper gets a chance to reject the symbol.

Suggested fix: use the full range check used elsewhere in this file before deriving symbol_count, including multiplication/addition overflow checks for symbol_count * symbol_entry_size. If the section is malformed, skip it or fail loudly, but the code should never form a symbol pointer outside the ELF byte span.

5. `append_elf_kernel_symbols` leaks expansion allocations on failure

After the expansion storage is allocated:

hsaco_metadata.c:L1456-L1464

a failure allocating extra_symbol_name_storage leaks expanded.kernels/args:

hsaco_metadata.c:L1465-L1470

Later IREE_RETURN_IF_ERROR exits during the second scan can leak both allocations, because the expanded buffers have not yet been installed into metadata and the caller's cleanup only sees the original metadata object:

Suggested fix: make the expansion path a status-gated block with one terminal cleanup section. Until ownership is committed into metadata, expanded.kernels and expanded.owned_string_storage need local cleanup on every error path.

Larger design concern: synthetic ELF kernels need to be quarantined/opted in

This helper now invents metadata for ELF function symbols that do not have metadata entries by picking one template kernel and copying its ABI/workgroup metadata:

That is a very strong producer-specific assumption. It may match a Tensile/hipBLASLt packaging pattern, but it is not generally true for arbitrary raw HSACO: a code object can contain multiple kernels with different argument layouts, required workgroup sizes, group/private segment sizes, or hidden suffix behavior. Publishing synthetic exports with copied metadata makes those kernels look valid until dispatch, where the failure mode is wrong argument layout rather than a clean load-time error.

Suggested fix: quarantine this behind an explicit opt-in. A few possible shapes:

Put the synthetic-kernel expansion behind a raw-HSACO executable flag/option that names the producer assumption, instead of doing it unconditionally in iree_hal_amdgpu_hsaco_metadata_initialize_from_elf.
Restrict it to a narrow, validated pattern: exactly one metadata template kernel, extra ELF kernel symbols known to share the same ABI by producer contract, and ideally an executable parameter that says HRX is intentionally loading that bundle form.
Keep the generic parser honest: by default, only kernels with actual AMDGPU metadata should become dispatchable exports. Missing metadata should mean "not reflectable/not dispatchable by this generic path", not "borrow an arbitrary template".

If this expansion stays, tests should include at least one negative case with two real metadata kernels of different ABI plus one extra symbol, so the loader proves it does not silently clone the wrong contract.

benvanik · 2026-06-02T16:49:23Z

(my prompt to gpt5.5-xhigh was: › github PR review time! https://github.com/ROCm/hrx-system/pull/12/ - I've already done a first pass and think it looks better, but I'd like you to see if there's any other risks in here (I found some performance ones in our critical dispatch path, and safety is always an issue with ensuring proper iree_status_t tracking/etc) - I recommend doing the same - the concerns identified above are all valid)

Support raw HSACO custom kernargs

71d66f5

Preserve native kernarg layouts for raw HSACO kernels so prepacked HIP argument buffers from libraries such as hipBLASLt can dispatch correctly through the AMDGPU HAL.

AWoloszyn added 2 commits June 1, 2026 05:43

Fix out of date test.

b7a8705

Some fixes so that raw metadata is more correct

85dba33

AWoloszyn marked this pull request as ready for review June 1, 2026 12:56

benvanik requested changes Jun 1, 2026

View reviewed changes

AWoloszyn and others added 8 commits June 1, 2026 13:36

Update runtime/src/iree/hal/drivers/amdgpu/executable.c

c019729

Co-authored-by: Ben Vanik <ben.vanik@gmail.com>

[amdgpu] Remove unused parameters.

e5b76d6

Refactor raw HSACO kernel arg resolution

fb1fd23

Hoist raw HSACO kernel iteration into executable creation so the resolver only writes a single host kernel-args record.

Remove duplicated AMDGPU implicit args offset

80c803b

Keep implicit-argument layout metadata in the dispatch kernarg layout instead of duplicating it in the hot kernel-args table.

[amdgpu] removed misleading function.

8bab76a

[amdgpu] clean up odd return.

7cd89b5

[amdgpu] Remove duplicated code in implicit arg handling.

7feed69

[amdgpu] Only copy bytes ones, and cache layout data.

cd925b3

benvanik self-requested a review June 2, 2026 16:34

benvanik reviewed Jun 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support raw HSACO custom kernargs#12

Support raw HSACO custom kernargs#12
AWoloszyn wants to merge 11 commits into
mainfrom
users/awoloszyn/amdgpu-raw-hsaco-custom-kernargs

AWoloszyn commented May 28, 2026 •

edited by stellaraccident

Loading

Uh oh!

stellaraccident commented May 28, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

benvanik left a comment

Uh oh!

benvanik commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

AWoloszyn commented May 28, 2026 • edited by stellaraccident Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stellaraccident commented May 28, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

benvanik left a comment

Choose a reason for hiding this comment

1. Raw kernarg reflection leaks into flatbuffer executable loading

2. Custom-direct validation accepts truly short argument buffers for the common no-implicit-suffix case

3. Host-queue custom-direct block count truncates before the safety check

4. ELF symbol augmentation can read past malformed section bounds

5. append_elf_kernel_symbols leaks expansion allocations on failure

Larger design concern: synthetic ELF kernels need to be quarantined/opted in

Uh oh!

benvanik commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

AWoloszyn commented May 28, 2026 •

edited by stellaraccident

Loading

5. `append_elf_kernel_symbols` leaks expansion allocations on failure